In this report, firstly, all data sets were cleaned and then a comprehensive summary with statistics was presented for each of them. Then, interesting relationships between World Development Indicators were explored. Correlations between most of the indicators were presented for general overview. Subsequently, more specific correlations between GDP per capita were shown, with visualizations. For example, association between GDP per capita and life expectancy was inspected. Afterwards, correlations of World Development Indicators with gold prices were analyzed, as preparation for creating a gold price prediction model. Lastly, using gathered knowledge, a regression model was created, which predicts gold prices with sufficient accuracy.
Following libraries and dependencies were used in the report:
library(knitr)
library(kableExtra)
library(DT)
library(tidyverse)
library(readxl)
library(janitor)
library(lubridate)
library(skimr)
library(corrr)
library(ggplot2)
library(ggthemes)
library(gganimate)
library(ggcorrplot)
library(plotly)
library(caret)
library(missForest)
library(countrycode)
The input data includes information on the economic indicators and development of each country as measured by over 200 statistics. In addition, it includes information on currency exchange rates, gold prices, Bitcoin trading, and the monthly performance of the S&P Composite.
The data was collected by various institutions, notably the World Bank and Nasdaq.
# https://data.worldbank.org/
wdi_df <- read_excel('./data/World_Development_Indicators.xlsx',
na = c('', '..')) %>% head(-5)
currency_df <- read_csv('./data/CurrencyExchangeRates.csv',
col_types = cols(Date = col_date("%Y-%m-%d")))
# https://data.nasdaq.com/data/LBMA/GOLD-gold-price-london-fixing
gold_prices_df <- read_csv('./data/Gold prices.csv')
# https://data.nasdaq.com/data/YALE/SPCOMP-sp-composite
sp_df <- read_csv('./data/S&P Composite.csv')
# https://data.nasdaq.com/data/BCHAIN-blockchain
btc_diff_df <- read_csv('./data/bitcoin/BCHAIN-DIFF.csv')
btc_hrate_df <- read_csv('./data/bitcoin/BCHAIN-HRATE.csv')
btc_mkpru_df <- read_csv('./data/bitcoin/BCHAIN-MKPRU.csv')
btc_trvou_df <- read_csv('./data/bitcoin/BCHAIN-TRVOU.csv')
While cleaning each data set, decisions described below were made.
.. values were interpreted as NA values. Last 5 rows were also dropped, because they contained data set metadata, such as last update date.Country Code and Series Code columns were dropped, as they were not useful in further analysisCountry Name and Series Name were pivoted into longer format; names of columns were stored into Year columnSeries Name columnCountry Name column was mutated into column of factor type and Year column was mutated into column of numeric typesnake_casewdi_cleaned_df <- wdi_df %>%
select(-c(`Country Code`, `Series Code`)) %>%
pivot_longer(-c(`Country Name`, `Series Name`), names_to = 'Year') %>%
pivot_wider(names_from = `Series Name`) %>%
mutate(`Country Name` = as_factor(`Country Name`),
Year = as.numeric(word(Year)))
wdi_names <- names(wdi_cleaned_df) %>%
set_names(make_clean_names(.)) %>%
as.list
wdi_cleaned_df <- set_names(wdi_cleaned_df, names(wdi_names))
Date column was parsed as column of date typeDate were pivoted into longer format; names of columns were stored into currency column, values were stored into exchange_rate columnNA values were droppedsnake_casecurrency column was mutated into column of factor typecurrency_df <- currency_df %>%
pivot_longer(cols = -Date, names_to = 'currency',
values_to = 'exchange_rate') %>%
drop_na() %>%
clean_names() %>%
mutate(currency = as_factor(currency))
snake_casegold_prices_df <- clean_names(gold_prices_df)
snake_caseyear column was renamed to datesp_df <- clean_names(sp_df) %>% rename(date = year)
snake_casevalue_diff to difficultyvalue_hrate to hash_ratevalue_mkpru to market_price_usdvalue_trvou to exchange_trade_volume_usdbtc_diff_df <- clean_names(btc_diff_df)
btc_hrate_df <- clean_names(btc_hrate_df)
btc_mkpru_df <- clean_names(btc_mkpru_df)
btc_trvou_df <- clean_names(btc_trvou_df)
btc_df <- btc_diff_df %>%
full_join(btc_hrate_df, by = 'date', suffix = c('_diff', '_hrate')) %>%
full_join(btc_mkpru_df, by = 'date') %>%
full_join(btc_trvou_df, by = 'date', suffix = c('_mkpru', '_trvou')) %>%
rename(difficulty = value_diff,
hash_rate = value_hrate,
market_price_usd = value_mkpru,
exchange_trade_volume_usd = value_trvou)
In this section, summary with basic statistics for each of the data sets and their variables was presented.
tribble(
~Table, ~`No. of columns (attributes)`, ~`No. of rows (observations)`,
'World Development Indicators', ncol(wdi_cleaned_df), nrow(wdi_cleaned_df),
'Currency Exchange Rates', ncol(currency_df), nrow(currency_df),
'Gold prices', ncol(gold_prices_df), nrow(gold_prices_df),
'S&P Composite', ncol(sp_df), nrow(sp_df),
'Bitcoin', ncol(btc_df), nrow(btc_df)
) %>% kbl() %>% kable_styling()
| Table | No. of columns (attributes) | No. of rows (observations) |
|---|---|---|
| World Development Indicators | 215 | 10608 |
| Currency Exchange Rates | 3 | 243689 |
| Gold prices | 7 | 13585 |
| S&P Composite | 10 | 1810 |
| Bitcoin | 5 | 4661 |
report_table <- function(df, round_digits=2) {
datatable(df, style = 'bootstrap', rownames = FALSE,
options = list(scrollX = TRUE)) %>%
formatRound(names(select_if(df, is.numeric)), round_digits)
}
wdi_skim_df <- wdi_cleaned_df %>%
set_names(wdi_names) %>%
skim()
wdi_skim_df %>%
yank('factor') %>%
clean_names(case = 'title') %>%
report_table
wdi_skim_df %>%
yank('numeric') %>%
clean_names(case = 'title') %>%
report_table()
currency_df %>%
clean_names(case = 'title') %>%
skim() %>%
clean_names(case = 'title') %>%
report_table()
A closer look at the data in Currency Exchange Rates data set reveals ambiguity whether currency exchange rates are from given currency to U.S. Dollar or vice versa. By looking at exchange rate for Polish Zloty, we can deduce that exchange rates in the data set are from U.S. Dollar to Polish Zloty. On the other side, looking at exchange rates of Euro or U.K. Pound Sterling suggests that the conversion in these cases is reverse; it is common knowledge that Euro and U.K. Pound Sterling are more valuable currencies than U.S. Dollar, so their exchange rate should be below 1.0, but it is above 1.0. Therefore, whole data set was discarded from further analysis.
currency_df %>%
filter(currency %in% c('Polish Zloty', 'U.K. Pound Sterling', 'Euro',
'U.S. Dollar')) %>%
group_by(date) %>%
filter(all(c('Polish Zloty', 'U.K. Pound Sterling', 'Euro', 'U.S. Dollar')
%in% currency)) %>%
tail(4) %>%
clean_names(case = 'title') %>%
kbl() %>% kable_styling()
| Date | Currency | Exchange Rate |
|---|---|---|
| 2018-04-30 | Euro | 1.2079 |
| 2018-04-30 | Polish Zloty | 3.4868 |
| 2018-04-30 | U.K. Pound Sterling | 1.3725 |
| 2018-04-30 | U.S. Dollar | 1.0000 |
gold_prices_df %>%
clean_names(case = 'title', abbreviations = c('USD', 'GBP', 'AM', 'PM')) %>%
skim() %>%
clean_names(case = 'title') %>%
report_table()
sp_df %>%
clean_names(case = 'title', abbreviations = c('CPI')) %>%
rename(`S&P Composite` = `S p Composite`) %>%
skim() %>%
clean_names(case = 'title') %>%
report_table()
btc_df %>%
clean_names(case = 'title', abbreviations = c('USD')) %>%
skim() %>%
clean_names(case = 'title') %>%
report_table()
Correlations between World Development Indicators were presented on the interactive correlation heat map below. Only variables with more than 50% complete rate (less than 50% of NA values) were selected. Correlation was calculated using the pairwise.complete.obs option which ensures that only complete pairs of observations of variables were used.
Hover over cells in the heat map to show correlated variables names and their correlation values.
wdi_na_filtered_df <- wdi_cleaned_df %>%
set_names(wdi_names) %>%
select(where(~mean(is.na(.)) < 0.50))
wdi_cor_matrix <- wdi_na_filtered_df %>%
select(3:last_col()) %>%
cor(use = 'pairwise.complete.obs')
wdi_cor_matrix[!lower.tri(wdi_cor_matrix)] <- NA
corr_plot <- wdi_cor_matrix %>%
ggcorrplot() +
labs(x = 'Variable 1', y = 'Variable 2') +
theme_classic() +
theme(axis.text = element_blank(), axis.ticks = element_blank())
ggplotly(corr_plot)
Below the heat map, table with all correlations in range (0.6, 0.9) was also included. Pair of variables with correlation higher than 0.9 were excluded, because they showed obvious relationships between variables, like percentage of urban and rural population or percentage of male and female population.
wdi_cor_matrix %>%
as_cordf() %>%
stretch() %>%
filter(abs(r) > 0.6, abs(r) < 0.9) %>%
arrange(desc(abs(r))) %>%
rename(X = x, Y = y, Correlation = r) %>%
report_table(round_digits = 5)
Per capita gross domestic product (GDP) measures a countryâs economic output per person and is calculated by dividing the GDP of a country by its population. Per capita GDP is a global measure used for gauging the prosperity of nations.
wdi_map_df <- wdi_cleaned_df %>%
filter(!country_name %in% c('Channel Islands', 'Kosovo',
'Low & middle income', 'Low income',
'Lower middle income', 'Middle income', 'World',
'Upper middle income', 'High income'))
wdi_map_df %>%
rename(`GDP per capita (USD)` = gdp_per_capita_current_us) %>%
plot_geo(locationmode = 'country names') %>%
add_trace(z = ~`GDP per capita (USD)`, zmin = 0, zmax = 70000,
color = ~`GDP per capita (USD)`,
frame = ~year,
colors = 'Blues',
text = ~country_name,
locations = ~country_name,
marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
colorbar(title = 'GDP per capita (USD)') %>%
layout(title = 'GDP per capita over years',
geo = list(scope = 'world',
projection = list(type = 'natural earth'),
countrycolor = toRGB('grey'),
showcoastlines = TRUE)) %>%
animation_opts(redraw = FALSE) %>%
plotly_build()
Preston curve is an empirical relationship between life expectancy and GDP per capita. It indicates that individuals born in richer countries, on average, can expect to live longer than those born in poor countries. The first plot is a reproduction of the Preston curve for year 2019 and countries with GDP per capita below 70000 USD.
wdi_continents_df <- wdi_map_df %>%
add_column(continent = countrycode(sourcevar = .$country_name,
origin = "country.name",
destination = "continent")) %>%
relocate(continent)
preston_curve_plot <- wdi_continents_df %>%
filter(year == 2019, gdp_per_capita_current_us < 70000) %>%
ggplot(aes(x = gdp_per_capita_current_us, y = life_expectancy_at_birth_total_years)) +
geom_point(aes(color = continent,
text = paste('Country:', country_name,
'<br>Continent:', continent,
'<br>GDP per capita (USD):', round(gdp_per_capita_current_us),
'<br>Life expectancy (years):', round(life_expectancy_at_birth_total_years, digits = 2)))) +
labs(x = 'GDP per capita (USD)',
y = 'Life expectancy (years)',
color = "Continent") +
geom_smooth(se = FALSE)
ggplotly(preston_curve_plot, tooltip = 'text')
The second plot presents how GDP per capita and life expectancy changed over years in countries grouped by continents. It also shows that in general, countries with higher GDP tend to have a higher life expectancy. It shows that with passing years standard of living is improving across the world as well.
wdi_continents_df %>%
ggplot(aes(gdp_per_capita_current_us,
life_expectancy_at_birth_total_years,
size = population_total,
color = continent)) +
geom_point(alpha = 0.7) +
scale_size_continuous(name = "Population",
breaks = c(1e9, 1e8, 1e7, 1e6),
limits = c(1e6, 1e9),
range = c(1, 8)) +
scale_x_log10(labels = scales::comma) +
theme_bw(base_size = 8) +
labs(title = "Year: {round(frame_time)}",
x = 'GDP per capita (USD)',
y = 'Life expectancy (years)',
color = "Continent",
caption = "") +
transition_time(year) +
ease_aes('linear')
High positive correlation can be immediately deduced by observing the animation, which is also confirmed by the following map plot.
wdi_life_expectancy_cor_df <- wdi_map_df %>%
group_by(country_name) %>%
summarize(correlation = cor(gdp_per_capita_current_us,
life_expectancy_at_birth_total_years,
use = 'pairwise.complete.obs'))
wdi_life_expectancy_cor_df %>%
plot_geo(locationmode = 'country names') %>%
add_trace(z = ~correlation, zmin = -1, zmax = 1,
color = ~correlation,
colors = c("blue", "white", "red"),
text = ~country_name,
locations = ~country_name,
marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
colorbar(title = 'Correlation') %>%
layout(title = 'GDP per capita (USD) - life expectancy (years) correlation',
geo = list(scope = 'world',
projection = list(type = 'natural earth'),
countrycolor = toRGB('grey'),
showframe = TRUE,
showcoastlines = TRUE)) %>%
plotly_build()
Standard of living is the material well being of the average person in a given population. Life expectancy increases with the standard of living. GDP per capita is a good approximation for standard of living, which explains why there is a high correlation between GDP per capita and life expectancy.
To validate that decrease in per capita GDP is associated with an increase in infant mortality rate, as found by the authors of Aggregate Income Shocks and Infant Mortality in the Developing World paper, the variables change over years 1999-2015 was presented in selected countries.
infmort_plot <- wdi_continents_df %>%
filter(country_name %in% c("United States", "Tonga", "Colombia", "Grenada",
"Sri Lanka", "Malta", "Germany", "Japan", "Sweden",
"Netherlands"),
year %in% 1999:2015) %>%
rename(`GDP per capita (USD)` = gdp_per_capita_current_us,
`Infant mortality rate (per 1,000 live births)` =
mortality_rate_infant_per_1_000_live_births,
Continent = continent,
Country = country_name) %>%
ggplot(aes(x = `GDP per capita (USD)`,
y = `Infant mortality rate (per 1,000 live births)`,
color = Continent,
tooltip = Country)) +
geom_point() +
facet_wrap(~year)
ggplotly(infmort_plot)
Once again, high correlation can be deduced from the plots, but this time it is negative, which is confirmed by the following map plot.
wdi_mortality_rate_cor_df <- wdi_map_df %>%
group_by(country_name) %>%
summarize(correlation = cor(gdp_per_capita_current_us,
mortality_rate_infant_per_1_000_live_births,
use = 'pairwise.complete.obs'))
wdi_mortality_rate_cor_df %>%
plot_geo(locationmode = 'country names') %>%
add_trace(z = ~correlation, zmin = -1, zmax = 1,
color = ~correlation,
colors = c("blue", "white", "red"),
text = ~country_name,
locations = ~country_name,
marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
colorbar(title = 'Correlation') %>%
layout(title = 'GDP per capita (USD) - mortality rate, infant (per 1,000 live births) correlation',
geo = list(scope = 'world',
projection = list(type = 'natural earth'),
countrycolor = toRGB('grey'),
showframe = TRUE,
showcoastlines = TRUE)) %>%
plotly_build()
To explore correlation between gold prices and global development indicators (for example whole worldâs GDP), several operations on data sets were made:
yearly_gold_prices_df data frame
gold_price_usd column was added, which is a mean of gold USD prices in the morning and in the afternoonNA values of gold_price_usd column were droppedyearly_sp_df data frame
world_joined_df data frame
country_name equal to "World" was selected from data set which only includes columns with less than 50% of NA valuesyear column with yearly_gold_prices_df data set from operation 1. was performed. This way only observations for years with known gold price were preservedyear column with yearly_sp_df data set from operation 2. was performed. This way no observations with known gold price were discardedBitcoin prices data set was discarded from further analysis, because it didnât contain data before year 2009 - Bitcoin didnât exist before that date. Including it into resulting joined data set would result in more than 50% observations with NA values for Bitcoin price variable.
yearly_gold_prices_df <- gold_prices_df %>%
mutate(gold_price_usd = (usd_am + usd_pm) / 2) %>%
drop_na(gold_price_usd) %>%
group_by(year = year(date)) %>%
summarize(gold_price_usd = mean(gold_price_usd))
yearly_sp_df <- sp_df %>%
group_by(year = year(date)) %>%
summarize(sp_composite = mean(s_p_composite))
world_joined_df <- wdi_na_filtered_df %>%
filter(`Country Name` == 'World') %>%
inner_join(yearly_gold_prices_df, by = c('Year' = 'year')) %>%
left_join(yearly_sp_df, by = c('Year' = 'year')) %>%
rename(`Gold price (USD)` = gold_price_usd,
`S&P Composite` = sp_composite)
In the table below, all absolute correlations with gold price higher than 0.6 were presented.
world_cor <- world_joined_df %>%
select(3:last_col()) %>%
cor(use = 'pairwise.complete.obs') %>%
remove_empty()
world_gold_cor <- world_cor %>%
as_cordf() %>%
focus(`Gold price (USD)`) %>%
arrange(desc(abs(`Gold price (USD)`))) %>%
filter(abs(`Gold price (USD)`) > 0.6) %>%
rename(`Gold price (USD) correlation` = `Gold price (USD)`,
Variable = term)
report_table(world_gold_cor, round_digits = 5)
Since in the table of gold correlations with other variables there is more than 50 variables with absolute correlation higher than 0.6, caret::findCorrelation function with cutoff parameter set to 0.9 was used to reduce pair-wise correlations.
vars_to_be_removed <- world_cor %>% findCorrelation()
world_gold_cor_subset <- world_joined_df %>%
select(3:last_col()) %>%
select(!all_of(vars_to_be_removed)) %>%
correlate() %>%
focus(`Gold price (USD)`) %>%
arrange(desc(abs(`Gold price (USD)`))) %>%
filter(abs(`Gold price (USD)`) > 0.6) %>%
rename(`Gold price (USD) correlation` = `Gold price (USD)`,
Variable = term)
correlated_variables <- world_gold_cor_subset %>% pull(Variable)
Most significant correlations were presented on the graph below.
world_gold_cor_plot <- world_gold_cor_subset %>%
ggplot(aes(x = reorder(Variable, `Gold price (USD) correlation`),
y = `Gold price (USD) correlation`,
text = paste('Variable:', Variable,
'<br>Gold price (USD) correlation:',
round(`Gold price (USD) correlation`, digits = 5)))) +
geom_col() +
ylim(-1, 1) +
labs(x = '') +
coord_flip() +
theme_minimal()
ggplotly(world_gold_cor_plot, tooltip = 'text')
Before training the model, there is a need to handle NA values in the data set in some way.
Missing values were imputed by using missForest library, which is based on Random Forest algorithm. missForest is one of the most accurate imputation methods, but at the cost of slow processing time. In this case, slow processing time is not a problem, because input data set is relatively small.
miss_forest <- world_joined_df %>%
select(3:last_col()) %>%
as.data.frame() %>%
missForest(verbose = TRUE)
## removed variable(s) 10 11 12 30 31 38 40 49 50 51 53 68 70 96 due to the missingness of all entries
## missForest iteration 1 in progress...done!
## estimated error(s): 0.04191611
## difference(s): 0.007450749
## time: 1.921 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.02928387
## difference(s): 0.0008458608
## time: 1.936 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.02754529
## difference(s): 4.07246e-05
## time: 1.94 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.03082877
## difference(s): 1.441841e-05
## time: 1.904 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.03182396
## difference(s): 4.347241e-06
## time: 1.914 seconds
##
## missForest iteration 6 in progress...done!
## estimated error(s): 0.02896512
## difference(s): 1.279036e-05
## time: 1.885 seconds
print(miss_forest$OOBerror)
## NRMSE
## 0.03182396
Data set was split into training and testing partitions in 75/25 ratio. A validation set was created using repeated cross-validation with number of divisions equal to 2 and number of repetitions equal to 5.
imputed_df <- miss_forest$ximp %>%
select(all_of(correlated_variables)) %>%
cbind(Year = world_joined_df$Year,
`Gold price (USD)` = world_joined_df$`Gold price (USD)`)
train_index <- createDataPartition(imputed_df$`Gold price (USD)`,
p = 0.75,
list = FALSE)
train_data <- imputed_df[train_index,]
test_data <- imputed_df[-train_index,]
ctrl <- trainControl(method = "repeatedcv",
number = 2,
repeats = 5)
The following graph shows the similarity of the distributions of the training and test data. Since input data consists only of 51 observations, there is some discrepancy between training and testing data partitions.
ggplot() +
geom_density(aes(`Gold price (USD)`, fill = "Train"), train_data, alpha = 0.6) +
geom_density(aes(`Gold price (USD)`, fill = "Test"), test_data, alpha = 0.6) +
labs(x = "Gold price (USD)", y = "Density", fill = "Partition data set")
Random Forest regression model was used to predict the gold prices.
fit <- train(`Gold price (USD)` ~ . - Year,
data = train_data,
method = "rf",
trControl = ctrl)
fit
## Random Forest
##
## 39 samples
## 26 predictors
##
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 20, 19, 20, 19, 19, 20, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 161.0114 0.9310910 102.3224
## 13 164.0964 0.9276829 104.2045
## 25 165.5569 0.9265937 106.4599
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
Two measures, R2 and RMSE, were used to assess prediction accuracy.
R2 measures the strength of the relationship between the model and the dependent variable on a 0 â 100% scale.
Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. RMSE is a good measure of how accurately the model predicts the response.
prediction <- predict(fit, test_data)
post_resample <- postResample(pred = prediction,
obs = test_data$`Gold price (USD)`)
post_resample
## RMSE Rsquared MAE
## 58.3185252 0.9866305 51.8092988
R2 value of 0.9866305 and RMSE value of 58.3185252 mean that model is accurate.
The following graph shows the test set values and the model output values.
prediction_comparison_df <- tibble(year = test_data$Year,
actual = test_data$`Gold price (USD)`,
predicted = prediction)
ggplot(prediction_comparison_df, aes(x = year)) +
geom_line(aes(y = actual, color = "Test data set")) +
geom_line(aes(y = predicted, color = "Predicted")) +
labs(color = "Values", x = "Year", y = "Gold price (USD)")
ggplot(varImp(fit))
By looking at the above plot, it turned out that pure economic indicators, such as GDP, GDP per capita and gross national expenditure were the most important in predicting gold prices.
Following them were environment-related variables, such as CO2 emissions from liquid fuel consumption.
Lastly, another important group of indicators was related to population. The reason for importance of these variables might be the fact that population is growing faster than supply of gold. It can be expected that a significant percentage of population will be willing to purchase jewelry or electronic products, whose manufacture uses gold.